The basic idea for this analysis is that to check what is causing people to join/visit hospitals and what are the top diagnostic codes that affects the people.
## X.1 X Year NewBorn
## Min. : 1 Min. : 1 Min. :2001 New Born : 177289
## 1st Qu.: 411542 1st Qu.: 97706 1st Qu.:2002 Not NewBorn:1468876
## Median : 823083 Median :180014 Median :2003
## Mean : 823083 Mean :179415 Mean :2003
## 3rd Qu.:1234624 3rd Qu.:262322 3rd Qu.:2004
## Max. :1646165 Max. :375372 Max. :2005
##
## UnitsofAge Age Sex
## Min. :1.000 Min. : 0.00 Female:971461
## 1st Qu.:1.000 1st Qu.:24.00 Male :674704
## Median :1.000 Median :48.00
## Mean :1.248 Mean :45.77
## 3rd Qu.:1.000 3rd Qu.:71.00
## Max. :3.000 Max. :99.00
##
## Race MaritalStatus
## White :864820 Divorced : 34477
## Black :230778 Married :237782
## Other : 71794 Separated: 6187
## Asian : 11519 Single :291457
## American Indian/Alaskan Native: 5260 Widowed : 77999
## (Other) : 2722 NA's :998263
## NA's :459272
## DischargeMonth DischargeStatus
## Min. : 1.000 Alive, disposition not stated : 76773
## 1st Qu.: 3.000 Dead : 27043
## Median : 6.000 Medical Advice : 12079
## Mean : 6.429 Routine :1050035
## 3rd Qu.: 9.000 transferred to long-term facility : 102889
## Max. :12.000 transferred to short-term facility: 35621
## NA's : 341725
## DaysofCare LengthofStay X.GeoLocation
## Min. : 1.000 Min. :0.0000 MidWest :472279
## 1st Qu.: 2.000 1st Qu.:1.0000 NorthEast:361134
## Median : 3.000 Median :1.0000 South :588806
## Mean : 4.746 Mean :0.9827 West :223946
## 3rd Qu.: 5.000 3rd Qu.:1.0000
## Max. :561.000 Max. :1.0000
##
## HospitalType Diagnosis.Code.1
## Charity :1359992 Deliver-single liveborn : 159528
## Government : 142457 Single lb in-hosp w/o cs: 124939
## Proprietary: 143716 Single lb in-hosp w cs : 46262
## Pneumonia, organism NOS : 43554
## CHF NOS : 42278
## (Other) :1126583
## NA's : 103021
## Diagnosis.Code.2 Diagnosis.Code.3
## Hypertension NOS : 45332 Hypertension NOS : 77979
## CHF NOS : 42764 CHF NOS : 32627
## Atrial fibrillation : 34973 DMII wo cmp nt st uncntr: 27214
## Chr airway obstruct NEC : 29821 Atrial fibrillation : 26943
## Urin tract infection NOS: 26172 Chr airway obstruct NEC : 23292
## (Other) :1175380 (Other) :971534
## NA's : 291723 NA's :486576
## Diagnosis.Code.4 Diagnosis.Code.5
## Hypertension NOS : 76975 Hypertension NOS : 63944
## DMII wo cmp nt st uncntr: 29290 DMII wo cmp nt st uncntr: 25731
## CHF NOS : 19004 Tobacco use disorder : 17524
## Tobacco use disorder : 18649 Crnry athrscl natve vssl: 17477
## Crnry athrscl natve vssl: 18450 Hyperlipidemia NEC/NOS : 16441
## (Other) :827810 (Other) :690579
## NA's :655987 NA's :814469
## Diagnosis.Code.6 Diagnosis.Code.7
## Hypertension NOS : 47110 Hypertension NOS : 33098
## DMII wo cmp nt st uncntr: 19328 DMII wo cmp nt st uncntr: 13997
## Crnry athrscl natve vssl: 14769 Crnry athrscl natve vssl: 11798
## Tobacco use disorder : 14498 Hyperlipidemia NEC/NOS : 11638
## Hyperlipidemia NEC/NOS : 14254 Esophageal reflux : 11293
## (Other) :557030 (Other) : 449030
## NA's :979176 NA's :1115311
## ModeofPayment X.secondpayment Admissiontype
## Min. : 1.000 Min. : 1 Elective :323414
## 1st Qu.: 2.000 1st Qu.: 5 Emergency:586090
## Median : 3.000 Median : 6 NewBorn :177288
## Mean : 5.413 Mean : 6 Urgent :297601
## 3rd Qu.: 6.000 3rd Qu.: 8 NA's :261772
## Max. :99.000 Max. :10
## NA's :1228352
## SourceofAdmission
## Emergency :592054
## Physician referral :441439
## Other :194751
## Transfer from hospital: 44860
## Clinical referral : 22566
## (Other) : 28650
## NA's :321845
We could see that there were more females than males and also there were more number of Emergency Admissions.
## [1] 5.1 4.1 4.1 2.1
## nhdsdatadf$Sex: Female
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 2.000 3.000 4.519 5.000 400.000
## --------------------------------------------------------
## nhdsdatadf$Sex: Male
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 2.000 3.000 5.072 6.000 561.000
Most of the discharges were less than 3 days.
Most of the patients were senior citizens. Also we could see the graph showed bimodal distribution. This was due to the admissions in regards to delivery of babies.
We could see that most number of patients come to hospital for routine check-up.
Something interesting could be seen that age of patients were less than one year but here we could see that there were more Non-New borns that New borns.
Whites are predominant race in this dataset.
We could see that patients were more in number from Souther Region followed by North-East and Mid-West regions of United States.
We could see that Married patients were more than any other patients followed by Widowed. There are less number of patients from Seperated category. So it meant either seperated people were more healthier or health concious in a way preventive kind of.
Highest number of patients use Medicare option followed by HMO/PPO.
Apart from Medicare option, Blue cross and Private Insurance paid patients were more.
More number of discharges were from Charity hospitals.
There were more number of Emergency Admissions followed by Physician referrals.
Congestive Heart Failure patients were highest that were visiting hospitals followed by Pneumonia. This mean most of the citizens of US were suffering from Heart related problems at least between 2001 and 2005.
From secondary Diagnosis Code we could infer that Atrial fillibration was the cause of congestive Heart failures.
Hypertension was the next big thing that people were suffering with and this was the reason we get the source of admission as routine with higher statistics.
Atherosclerosis - forming plaque in blood vessels was the next highest diagnostic problem people were identified with.
The percentage of Whites is more among the races.
There were 1646165 observations in the dataset with 36 features and I have selected omitted 10 features for this analysis. This dataset was mainly about the discharge status of US patients for a period 2001-2005. Their diagnosis, race, payment method etc were some features of interest.
Other Observations:
## [1] 4.745727
## [1] 45.76869
## [1] 3
## [1] 48
## [1] 6.995645
## [1] 28.37315
The main features in the dataset were Race,Sex, Age, Discharge Status, Diagnosis Code, Source of Admission and Mode of payment.
Weight and Geolocation were likely to contribute to the discharge status of the patients.
No.I have not created any new variable.
We could see bimodal distrbution for Age feature. I have done lot of cleaning the data such as removing spaces and blanks, replacing diagnosis codes based on ICD codes, factoring almost all features.
We could see that the boxplot for Males was taller than the Females which suggest that Males stayed more than the Females. Most of the Females were dsicharged/stayed at the hopsital for same number of days relatively. We wanted to check whether males or females stayed longer in hospitals.
Though females were on higher side of count, males stayed longer than females. This was intersting to observe that though men were physically stronger, they take more time to recover than women. We wanted to see whether the proportion of females were more when missing values included as well and as expected females has higher proportion of visiting and staying in hospitals.
Compared to Females, Males started visiting hospitals at a relative younger age. Median of Males were on a bit higher side than Females. We wanted to check the hospital visits’ age range based on gender while considering NA values, and we can see here Males were visiting more than females.
Female discharges are higher between the ages 20-35 for obvious reasons such as pregnancy. After 50 males and females who were visiting hospitals were same in number.However, at later stages again female patients who visit hospitals were increased in number which says females are more prone for various ailments.
We could observe that most of the diseases started from late 30s’. Yes most of them are outliers that significantly change the way we interpret the data. However, we can’t deny the fact of people falling ill early. Interestingly we could see pregnant women at as high as 47 years.
We could see the trend that, as the age advances the number of days of stay also increases. The expectation was - days of care should increase as the age increase and the chart showed the same.
Apart from routine check-up, more females were advised to go for long-term facility than men.
##
## Alive Dead Medical Advice Routine long term stay
## Female 43728 13838 4872 623974 64676
## Male 33045 13205 7207 426061 38213
##
## short term stay
## Female 18426
## Male 17195
Comparitively Physical referral has more number of females than males.
##
## Female Male
## Clinical referral 14733 7833
## Court/law enforcement 1167 1571
## Emergency 322776 269278
## HMO referral 4131 1468
## Other 97577 97174
## Physician referral 299991 141448
## Transfer from hospital 22737 22123
## Transfer from other health facility 6562 5916
## Transfer from skilled nursing facility 4794 3041
Most of the Emergency cases were recorded between ages 60 and 75.
##
## MidWest NorthEast South West
## Female 277582 207066 351787 135026
## Male 194697 154068 237019 88920
People in Souther region were more prone to diseases Mid-west region has the healthiest people.
##
## MidWest NorthEast South West
## Female 277582 207066 351787 135026
## Male 194697 154068 237019 88920
Patients or the people visiting hospitals were more in Souther region than other regions.The expectation was - which region is more prone to diseases or the people of which region fall ill often.
Most of the people were availing Medicare option for payment followed by HMO/PPO and private insurance.
The bimodal distribution for Females was from pregnancy and regular check-ups. Also we could see Men were visiting regularly as early as from the age of 35.
We could see that Black men were more healthier than White men and Black women as well.
The percentage of Whites is more than any other race. The chart was expected to find the percentage of Races according to mode of admission. Among the source of admission, Emergency and Physical referral were the more than any other mode of admission.
##
## MidWest NorthEast South West
## American Indian/Alaskan Native 200 1189 2405 1466
## Asian 779 1692 2552 6496
## Black 51282 50878 114466 14152
## Multiple race 14 111 108 128
## Native Hawaiian or other Pacific Isldr 101 163 507 1590
## Other 4351 22201 34684 10558
## White 155328 245473 347152 116867
Whites from Southern location fell ill often than other races from all the regions. The chart was drawn to understand the area-wise percentage of different races.
Comparitively Whites were more prone to ill-health than any other race. We have omitted other races as they were relatively very negligible. The expectation was - Whites percentage should be more than the other races to support previous claims.
The expectation from this chart was that the percentage of females should be more than Males to support previous claims.
Admissions were more in Charity hospitals than in any other hospitals. We wanted to see which type of hospitals were serving more people even though we came to know that most people pay through Medicare. So the question now is - Are the charity hospitals really doing charity?
The dataset I used was limited to 5 years data. Its something interesting that there was relationship between the race and location. Whites were highly prone to heart attacks as that was the top diagnosis code whites see the doctor. Also we could observe from the dataset that Whites see doctor regularly as early as from 35th year.
Lot of Charity activities were going in Southern region as compared to other regions. It could be infered that there were way big number of charity hospitals in Midwest and South. Though Mid-west has slightly more number of hospitals than South, South recorded more number of charity discharges than Mid-West. This seems Southern region was under developed or way too rich as the there might be many people started charity organizations in their regions to support the citizens. However we could see that there way too high number of proprietary hospitals that confirm the existence of affordable or rich class of people in South.
The strongest relation I found was the Race Vs Location. This might also be due to the fact of the white population more in the region. However given the fact that Whites were almost double the population of Blacks, the proportion of hospital visits was not according to the population. So Whites were more prone to the diseases than Blacks,
From the heat map we could see that Delivery Single born live started at the beginning of teen age itself. Coronary Heart Failure (CHF) was observed at relative younger age of around 35.
## [1] 11
## [1] 58
The min age for females that joined hospital for Delivery is 11 and max age is 58.
## Crnry athrscl natve vssl Single lb in-hosp w/o cs Obs chr bronc w(ac) exac
## 202 235 363
## Subendo infarct, initial CHF NOS Pneumonia, organism NOS
## 900 1208 1559
The top 5 diagnostic codes that people were discharged as Dead. From the above Diagnostic Codes people were joining hospitals under Emergency Category.
We could see that infant boys died most than infact girls.
## [1] 5
## [1] 76
## [1] 9.126502
## [1] 69.99652
## [1] 20.6802
## [1] 14.17939
## Atrial fibrillation Subendo infarct, initial Obs chr bronc w(ac) exac
## 9679 10165 11727
## Crnry athrscl natve vssl CHF NOS Pneumonia, organism NOS
## 12847 26166 26400
Whites were using Propreitary Hospitals even though Charity and Government Hospitals. At the same time Whites were the most that visit hospitals. Though there were more charity hospitals in Southern region, people used more charity hospitals in North East region than Southern region. So this might be due to the affordable class in South compared to other regions.
We could see the range of Men was more when compared to Women. Men might be suffering with various ailments. Also we could see that except in Mid-west region, Men were prone to diseases since childhood.
Men were more prone to ailments at early age than Women. We could also observe that there was nearly 10 years age difference between Men’s and Women’s life span.
We could see a dip in Medical advice for North-East people. We could infer that they visit hospitals at later stages of diseases. Also peple from North-Eastern region were discharged as dead more than any other region. Either facilities might not be there or people visit hospitals at later stages where ailment was not curable even transferred to long-term facility as we could see from the graph that people from this region were transferred more to long-term facility.
##
## MidWest NorthEast South West
## Alive 19975 15978 32448 8372
## Dead 7581 6297 9883 3282
## Medical Advice 2504 4029 4257 1289
## Routine 293772 217138 387703 151422
## long term stay 34756 26954 29587 11592
## short term stay 10317 8095 13131 4078
However, the statistics shows a different version. So missing values were playing an important role in this particular analysis. We could infer much accurately if we could check data for few more years before coming to any conclusion.
This chart concreted the claim of Routine check-up as the most used Discharge status. Also the delivery of boys were slightly higher than that of girls.
Women were healthier than Men and we could see that the White people were more prone to ailments compared to any other race. It could also be inferred that the Southern region has more number of hospitals as we could see more number of hospitals there.
Also Souther region has more number of propreitary hospitals compared to other regions that signifies more number of affordable class present there. So we could co-relate more money more ailments.
Women lived longer than Men. Applying log scale to y-axis shwowed even more detailed information. Even Women who got discharged from hospitals were more in number than Men.Though we have higher number of women in the given dataset, the data of Men which was recorded showed the health trend among Men in US across all regions. We could also observe that baby boys were more prone to death when compared to baby girls.
Most of the People at later age were sent to long-term facility. This mean there were very good facilities available that could treat dealy diseases. Median for Alive Discharge status is around 60 which is good and also the Median for Dead is around 65. Routine check-ups started at as early as late 30s.(probably Diabeties)
## : American Indian/Alaskan Native
## : Alive
## [1] 61
## --------------------------------------------------------
## : Asian
## : Alive
## [1] 68
## --------------------------------------------------------
## : Black
## : Alive
## [1] 60
## --------------------------------------------------------
## : Multiple race
## : Alive
## [1] 61.5
## --------------------------------------------------------
## : Native Hawaiian or other Pacific Isldr
## : Alive
## [1] 59
## --------------------------------------------------------
## : Other
## : Alive
## [1] 57
## --------------------------------------------------------
## : White
## : Alive
## [1] 72
## --------------------------------------------------------
## : American Indian/Alaskan Native
## : Dead
## [1] 65
## --------------------------------------------------------
## : Asian
## : Dead
## [1] 77
## --------------------------------------------------------
## : Black
## : Dead
## [1] 67
## --------------------------------------------------------
## : Multiple race
## : Dead
## [1] 77.5
## --------------------------------------------------------
## : Native Hawaiian or other Pacific Isldr
## : Dead
## [1] 73
## --------------------------------------------------------
## : Other
## : Dead
## [1] 65
## --------------------------------------------------------
## : White
## : Dead
## [1] 77
## --------------------------------------------------------
## : American Indian/Alaskan Native
## : Medical Advice
## [1] 43
## --------------------------------------------------------
## : Asian
## : Medical Advice
## [1] 50
## --------------------------------------------------------
## : Black
## : Medical Advice
## [1] 43
## --------------------------------------------------------
## : Multiple race
## : Medical Advice
## [1] 63
## --------------------------------------------------------
## : Native Hawaiian or other Pacific Isldr
## : Medical Advice
## [1] 46
## --------------------------------------------------------
## : Other
## : Medical Advice
## [1] 41
## --------------------------------------------------------
## : White
## : Medical Advice
## [1] 44
## --------------------------------------------------------
## : American Indian/Alaskan Native
## : Routine
## [1] 32
## --------------------------------------------------------
## : Asian
## : Routine
## [1] 34
## --------------------------------------------------------
## : Black
## : Routine
## [1] 38
## --------------------------------------------------------
## : Multiple race
## : Routine
## [1] 38
## --------------------------------------------------------
## : Native Hawaiian or other Pacific Isldr
## : Routine
## [1] 33
## --------------------------------------------------------
## : Other
## : Routine
## [1] 26
## --------------------------------------------------------
## : White
## : Routine
## [1] 43
## --------------------------------------------------------
## : American Indian/Alaskan Native
## : long term stay
## [1] 70
## --------------------------------------------------------
## : Asian
## : long term stay
## [1] 78
## --------------------------------------------------------
## : Black
## : long term stay
## [1] 74
## --------------------------------------------------------
## : Multiple race
## : long term stay
## [1] 78
## --------------------------------------------------------
## : Native Hawaiian or other Pacific Isldr
## : long term stay
## [1] 71
## --------------------------------------------------------
## : Other
## : long term stay
## [1] 74
## --------------------------------------------------------
## : White
## : long term stay
## [1] 80
## --------------------------------------------------------
## : American Indian/Alaskan Native
## : short term stay
## [1] 63.5
## --------------------------------------------------------
## : Asian
## : short term stay
## [1] 70
## --------------------------------------------------------
## : Black
## : short term stay
## [1] 57
## --------------------------------------------------------
## : Multiple race
## : short term stay
## [1] 57
## --------------------------------------------------------
## : Native Hawaiian or other Pacific Isldr
## : short term stay
## [1] 44
## --------------------------------------------------------
## : Other
## : short term stay
## [1] 50
## --------------------------------------------------------
## : White
## : short term stay
## [1] 67
Crnry athrscl natve vssl was more common amonge Men. Even Pneumonia was also among the top diseases that affected Men more than Women. As this is a pollution related ailment, we could infer that there were more working Men than Women because working men were exposed more to pollution outside. Another interesting finding is more Women were discharged with Rehabilitation procedure than Men.
The NHDS data set contains 1646165 observations from 2001 to 2005. I started cleaning the data first and factoring them. All NA values were omitted from the analysis. Yes, I could see that there is difference/variance in the relation between the variables when compared to the dataset with missing values to the dataset without missing values. As this is the Survery data it was assumed that the outliers were natural. I also mapped all the diagnostic codes with appropriate short descriptions using scripts. I started exploring each variable in the dataset and then explored interesting questions such as which gender was more prone to seeing doctor apart from regular delivery for Women. Different type of charts are showing different kind of perspectives and is really harder to pick the best ones that shows some of the relations between the featuers. I even consulted a coach from Udacity and after discussion I came to know that, I have taken really a complext data set. So I followed the coach’s suggestions such as trying to identify the co-relation between diagnostic codes and discharge status and sex etc,. I could see some strong relation between race and the diagnosis code. Eventually I explored deeper to see any demographic connection for diagnosis code. Also the other thing I could observe is the population mostly consist of Whites and Blacks and the proportion to ailments between whites and Blanks were very different. Even I googled a bit about the regions in US to cross check the population and socio-economic conditions to cross verify my findings. Blacks were not that proportional to the ailments as their population represent whereas Whites had very much higher proportion. In fact Diabeties, for which US is notorious for, is not on the top 5 diagnostic list as it might have covered under routine check-up and the status shows the same that Routine check-up has the highest count. A recent data set would have been better to come to a solid conclusion.
The main idea for future work is that we could take the dataset of Whites (as they were more in number in the dataset) alone along with more years data and could get what really bothering/affecting them.